Analyzing Jokerace Submissions: Vision, Mission, and Strategic Priorities¶

Author: Omniacs.DAO
No description has been provided for this image

Case Study: From Raw Text to Actionable Insights with Python and OpenAI¶

How can you make sense of thousands of free-text submissions from the Arbitrum community? This is a common challenge for decentralized autonomous organizations (DAOs), companies, and projects that rely on community feedback to guide their strategy. Raw text is messy, unstructured, and time-consuming to analyze manually.

In this case study, we'll walk through a complete, real-world workflow for analyzing community proposal submissions from the Arbitrum DAO's Jokerace contests. We will use a powerful combination of classic NLP techniques and modern AI to extract meaningful themes and generate a high-level summary.

You will learn how to:

  1. Preprocess and Clean raw text data for analysis.
  2. Use N-gram Analysis to find common multi-word phrases.
  3. Apply Topic Modeling (LDA) to discover latent themes in the submissions.
  4. Leverage the OpenAI API to synthesize your findings into a concise, human-readable report.
No description has been provided for this image

1. Project Overview and Setup¶

Objective

Plurality Labs, founded by @DisruptionJoe and supported by his teammates @prose11 | GovAlpha and @radioflyerMQ were given the task to setup experimental governance mechanisms that would allow for formalized "sense-making" that will help the DAO come up with a strong mission and vision to guide their decisions as they then follow up with initiating the capital allocation process where grants will be given out to support the ARB ecosystem.

One of those experiments was an online incentivized survey hosted by "JokerRace" as part of the ThanksARB initiative of Arbitrum's "#GOVMonth". The Jokeraces consisted of 4 separate on-chain surveys where users were prompted to provide feedback on:

  • ArbitrumDAO Short-term Strategic Priorities (Reduce Friction) - 0xbf47BDA4b172daf321148197700cBED04dbe0D58
  • ArbitrumDAO Long-term Strategic Priorities (Growth and Innovation) - 0x5D4e25fA847430Bf1974637f5bA8CB09D0B94ec7
  • ArbitrumDAO Northstar Strategic Framework: Vision - 0x0d4C05e4BaE5eE625aADC35479Cc0b140DDF95D4
  • ArbitrumDAO Northstar Strategic Framework: Mission - 0x5a207fA8e1136303Fd5232E200Ca30042c45c3B6

The goal of this case-study is to analyze submissions across four Jokerace contests—Vision, Mission, Short-term priorities, and Long-term priorities—to:

  1. Identify key themes, concerns, and strategic ideas proposed by the community.
  2. Develop a reusable methodology for future text analysis.
  3. Create a summarized report of the findings.

Python Environment Setup

First, let's set up our Python environment by loading the necessary libraries. These packages cover everything from data manipulation (pandas) and text mining (nltk, scikit-learn) to API communication (requests).

In [22]:
# --- Initial Setup --- 
import pandas as pd
import numpy as np
import re
import string
import os
import json
import requests
from datetime import date
import re

# For text mining and NLP
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import nltk
from nltk.stem import WordNetLemmatizer


# Download necessary NLTK data (if not already downloaded)
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet')
nltk.download('omw-1.4')

# Display options for pandas DataFrames
pd.set_option('display.max_colwidth', 100)
pd.options.display.float_format = '{:.4f}'.format
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Business\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Business\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!

Loading the Data

We'll use one main dataset:

propfinal: Contains the core proposal submissions, including the text content, author, and timestamp from JokerAce. It is loaded from an R-specific .RDS file using the pyreadr library.

In [23]:
# Load Data from .csv file
data_path = "../data/DataProposalsJokerAce.csv"
result = pd.read_csv(data_path)
propfinal = result # The object is stored in a dictionary with key None

# Display the first few rows of the DataFrame
propfinal.head()
Out[23]:
BlockTime Contract Address Slug URL IsImage Content ContentParsed
0 2023-09-06T09:12:57Z 0xbf47bda4b172daf321148197700cbed04dbe0d58 0x6899dd4c3ceab23e7022ff5ead9b9e9c78333105 5877857128534105584654838688249472014792481672535544401662666939243866098045 https://jokerace.xyz/_next/data/k6jlVObdmnNyS2qiJRXS4/en-US/contest/arbitrumone/0xbf47bda4b172da... False <p>Arbitrum DAO's short term is to enhance accessibility, user experience, and community support... Arbitrum DAO's short term is to enhance accessibility, user experience, and community support
1 2023-09-06T09:13:36Z 0x5d4e25fa847430bf1974637f5ba8cb09d0b94ec7 0x6899dd4c3ceab23e7022ff5ead9b9e9c78333105 32714819087882020631392000763036800356962515213469797779328175746480751433321 https://jokerace.xyz/_next/data/k6jlVObdmnNyS2qiJRXS4/en-US/contest/arbitrumone/0x5d4e25fa847430... False <p>Arbitrum DAO's long term is to establish a robust, secure, and globally recognized blockchain... Arbitrum DAO's long term is to establish a robust, secure, and globally recognized blockchain ec...
2 2023-09-06T09:14:02Z 0x0d4c05e4bae5ee625aadc35479cc0b140ddf95d4 0x6899dd4c3ceab23e7022ff5ead9b9e9c78333105 79434989857275367523553028190443079857085132323129240161246620406681346326246 https://jokerace.xyz/_next/data/k6jlVObdmnNyS2qiJRXS4/en-US/contest/arbitrumone/0x0d4c05e4bae5ee... False <p>Arbitrum DAO's vision is to revolutionize blockchain adoption and transform global finance th... Arbitrum DAO's vision is to revolutionize blockchain adoption and transform global finance throu...
3 2023-09-06T09:14:34Z 0x5a207fa8e1136303fd5232e200ca30042c45c3b6 0x6899dd4c3ceab23e7022ff5ead9b9e9c78333105 22234905525497835912853491622795712327417602772596009688197720573731403513863 https://jokerace.xyz/_next/data/k6jlVObdmnNyS2qiJRXS4/en-US/contest/arbitrumone/0x5a207fa8e11363... False <p>Arbitrum DAO's mission is to drive innovation, foster collaboration, and empower individuals ... Arbitrum DAO's mission is to drive innovation, foster collaboration, and empower individuals in ...
4 2023-09-09T11:34:17Z 0x5a207fa8e1136303fd5232e200ca30042c45c3b6 0x88496a9b5d7543148d9f8aecc9f69cb8437a4bae 58550340301071742211754996440353015903983381407702370649470310016744345829658 https://jokerace.xyz/_next/data/k6jlVObdmnNyS2qiJRXS4/en-US/contest/arbitrumone/0x5a207fa8e11363... False <p>cxvxcvfds</p> cxvxcvfds
No description has been provided for this image

2. Data Cleaning and Preprocessing¶

Text data from the wild is messy. It's filled with HTML tags, punctuation, and inconsistent capitalization. To prepare it for analysis, we need to standardize it through a process called preprocessing. Our pipeline will:

  1. Remove HTML tags.
  2. Convert all text to lowercase.
  3. Remove punctuation.
  4. Stem words to their root form (e.g., "running," "ran," and "runs" all become "run"). This helps group related words.
  5. Trim excess whitespace.
In [24]:
# --- Clean and Preprocess Text ---

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

def clean_lemmatize(text):
    # Lowercase
    text = text.lower()
    # Replace slashes and punctuation with spaces
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    text = re.sub(r'/', ' ', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    # Lemmatize each word
    words = [lemmatizer.lemmatize(w) for w in text.split()]
    return ' '.join(words)

# Apply to your series
propfinal_clean = propfinal_clean.apply(clean_lemmatize)

# Display a before-and-after comparison
df_preview = pd.DataFrame({
    'Raw': propfinal['Content'].head(),
    'Cleaned': propfinal_clean.head()
})

df_preview
Out[24]:
Raw Cleaned
0 <p>Arbitrum DAO's short term is to enhance accessibility, user experience, and community support... arbitrum dao short term is to enhanc access user experi and commun support
1 <p>Arbitrum DAO's long term is to establish a robust, secure, and globally recognized blockchain... arbitrum dao long term is to establish a robust secur and global recogn blockchain ecosystem
2 <p>Arbitrum DAO's vision is to revolutionize blockchain adoption and transform global finance th... arbitrum dao vision is to revolution blockchain adopt and transform global financ through decent...
3 <p>Arbitrum DAO's mission is to drive innovation, foster collaboration, and empower individuals ... arbitrum dao mission is to drive innov foster collabor and empow individu in the decentr landscap
4 <p>cxvxcvfds</p> cxvxcvfd

This clean, stemmed text is now ready for more advanced analysis.

No description has been provided for this image

3. Exploratory Analysis: N-grams¶

What are the most common phrases in the submissions? While looking at single words (unigrams) is useful, n-grams (sequences of n words) give us more context. For example, the trigram "reduce transaction fees" is far more insightful than the individual words "reduce," "transaction," and "fees."

Let's find the most common trigrams (3-word phrases), excluding common "stop words" like "the," "a," and "is". We'll also add custom stop words specific to our dataset, like "arbitrum" and "dao," to filter out noise.

In [25]:
# --- N-gram Analysis ---

# Load a standard list of English stopwords from a reliable source
try:
    stopwords_url = "https://slcladal.github.io/resources/stopwords_en.txt"
    response = requests.get(stopwords_url)
    response.raise_for_status() # Raise an exception for bad status codes
    english_stopwords = response.text.splitlines()
except requests.exceptions.RequestException as e:
    print(f"Failed to download stopwords, using NLTK's list: {e}")
    from nltk.corpus import stopwords
    english_stopwords = stopwords.words('english')

# Add custom stopwords
custom_stopwords = ["arbitrumdao", "arbitrum", "project", "arb", "dao"]
stop_words = list(set(english_stopwords + custom_stopwords))

# Use CountVectorizer to get trigrams, removing stopwords
vectorizer = CountVectorizer(ngram_range=(3, 3), stop_words=stop_words)
X = vectorizer.fit_transform(propfinal_clean.dropna())

# Sum the counts of each trigram
trigram_counts = X.sum(axis=0)
trigram_freq = [(word, trigram_counts[0, idx]) for word, idx in vectorizer.vocabulary_.items()]

# Sort the trigrams by frequency
trigram_freq = sorted(trigram_freq, key=lambda x: x[1], reverse=True)

# Create a DataFrame for display
trigram_table = pd.DataFrame(trigram_freq, columns=['Trigram', 'n'])

# Display the top 30 most frequent trigrams
print("Top 30 Trigrams:")
trigram_table.head(30)
C:\Users\Business\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ain', 'aren', 'couldn', 'didn', 'doesn', 'don', 'hadn', 'hasn', 'haven', 'isn', 'll', 'mon', 'shouldn', 've', 'wasn', 'weren', 'won', 'wouldn'] not in stop_words.
  warnings.warn('Your stop_words may be inconsistent with '
Top 30 Trigrams:
Out[25]:
Trigram n
0 make wise decis 212
1 believ make wise 211
2 agre becaus believ 207
3 becaus believ make 206
4 agre make develop 152
5 read agre posit 136
6 agre posit thing 131
7 layer scale solut 122
8 creat anti sybil 100
9 shortterm strateg prioriti 96
10 longterm strateg prioriti 86
11 thi fantast great 86
12 glad part thi 79
13 ensur longterm success 78
14 support make thi 77
15 make thi envi 77
16 thi envi crypto 77
17 envi crypto world 77
18 team dedic ingenu 76
19 fan support make 76
20 regular secur audit 74
21 decentr financ defi 73
22 pain point barrier 70
23 vision mission valu 70
24 fantast great potentialth 69
25 great potentialth team 69
26 potentialth team dedic 69
27 hope airdrop histori 66
28 mission creat decentr 66
29 dedic ingenu excel 65

This output immediately gives us a feel for recurring suggestions and ideas within the community.

No description has been provided for this image

4. Uncovering Themes with Topic Modeling (LDA)¶

While n-grams show us popular phrases, topic modeling helps us discover the underlying, latent themes across all submissions. We'll use Latent Dirichlet Allocation (LDA), a popular unsupervised algorithm that works by:

  • Assuming each document is a mix of topics.
  • Assuming each topic is a mix of words.
  • Figuring out the "topics" (which are just clusters of words) that best explain the collection of documents.

The LDA Workflow

We'll create a repeatable workflow to process each contest's submissions. We'll use scikit-learn's CountVectorizer to create the Document-Term Matrix (DTM) and LatentDirichletAllocation to perform the topic modeling.

Set Up and Run the Analysis Loop

Now, we define the contests we want to analyze and the number of topics (K) we want to find for each. Choosing K is a mix of art and science; it often requires experimentation to find a number that produces coherent, distinct topics.

The loop will perform the following for each contest:

  1. Subset the proposals for that contest.
  2. Perform final cleaning and filtering.
  3. Create a Document-Term Matrix (DTM) using CountVectorizer.
  4. Run the LDA model using LatentDirichletAllocation.
  5. Assign the most likely topic to each proposal.
  6. Extract the top words for each discovered topic.
In [26]:
# --- Define Contests and K (Number of Topics) ---
contest_data = [
  {"Address": "0xbf47bda4b172daf321148197700cbed04dbe0d58", "Name": "Reduce Friction", "Slug": "RF4", "Topics": 4},
  {"Address": "0x5d4e25fa847430bf1974637f5ba8cb09d0b94ec7", "Name": "Growth and Innovation", "Slug": "GI5", "Topics": 5},
  {"Address": "0x0d4c05e4bae5ee625aadc35479cc0b140ddf95d4", "Name": "Vision", "Slug": "V7", "Topics": 7},
  {"Address": "0x0d4c05e4bae5ee625aadc35479cc0b140ddf95d4", "Name": "Vision", "Slug": "V10", "Topics": 10},
  {"Address": "0x5a207fa8e1136303fd5232e200ca30042c45c3b6", "Name": "Mission", "Slug": "M11", "Topics": 11},
  {"Address": "0x5a207fa8e1136303fd5232e200ca30042c45c3b6", "Name": "Mission", "Slug": "M15", "Topics": 15}
]
contestdf = pd.DataFrame(contest_data)

# Create a directory to store outputs
output_dir = "../project1/temp/txts"
for slug in contestdf['Slug']:
    os.makedirs(os.path.join(output_dir, slug), exist_ok=True)

# --- Main LDA Analysis Loop ---
results_list = []
propout = propfinal.copy()

for idx, contest in contestdf.iterrows():
    # --- 1. Subset text for this contest address ---
    subset = propfinal[propfinal['Contract'] == contest['Address']].copy()
    subset['original_idx'] = subset.index
    
    # --- 2. Clean and filter ---
    # Note: R's iconv is for encoding. Python strings are generally UTF-8.
    subset = subset[subset['ContentParsed'].str.len() > 40]
    subset = subset.drop_duplicates(subset=['ContentParsed'], keep='first')

    if len(subset) < contest['Topics']:
        print(f"Skipping {contest['Name']} ({contest['Slug']}): Not enough documents.")
        continue

    textdata = subset['ContentParsed']

    # --- 3. DTM using CountVectorizer ---
    # Stemming and preprocessing is done here
    stemmer = PorterStemmer()
    def stem_and_tokenize(text):
      tokens = word_tokenize(text.lower())
      return [stemmer.stem(i) for i in tokens]

    vectorizer = CountVectorizer(
        tokenizer=stem_and_tokenize, 
        stop_words=stop_words, 
        min_df=5, # equivalent to R's min_freq=5
        token_pattern=r'(?u)\b\w\w+\b'
    )
    try:
        dtm = vectorizer.fit_transform(textdata)
    except ValueError:
        print(f"Skipping {contest['Name']} ({contest['Slug']}): Could not create DTM, likely due to all terms being stopwords.")
        continue

    # --- 4. Topic modeling ---
    K = contest['Topics']
    lda = LatentDirichletAllocation(n_components=K, random_state=1234, n_jobs=-1, learning_method='batch')
    lda.fit(dtm)

    # --- 5. Assign topics and annotate main dataframe ---
    topic_results = lda.transform(dtm)
    subset['Topic'] = topic_results.argmax(axis=1) + 1 # Add 1 to be 1-indexed like R
    propout.loc[subset['original_idx'], contest['Slug']] = subset['Topic']
    
    # --- 6. Save per-topic submissions ---
    for tidx in range(1, K + 1):
        topic_docs = subset[subset['Topic'] == tidx]
        output_path = os.path.join(output_dir, contest['Slug'], f"Topic-{tidx}.txt")
        with open(output_path, 'w', encoding='utf-8') as f:
            for i, row in topic_docs.iterrows():
                f.write(f"Submission : {i}\n{row['ContentParsed']}\n\n\n")

    # --- 7. Save top terms ---
    feature_names = vectorizer.get_feature_names()
    topic_terms = {}
    for topic_idx, topic in enumerate(lda.components_):
        top_terms_idx = topic.argsort()[:-50 - 1:-1]
        top_terms = [feature_names[i] for i in top_terms_idx]
        topic_terms[f"{contest['Slug']}_Topic_{topic_idx + 1}"] = top_terms
    results_list.append(pd.DataFrame(topic_terms))
    print(f"Finished: {contest['Name']} ({contest['Slug']})")

# Combine all results
if results_list:
    results = pd.concat(results_list, axis=1)
    results.insert(0, 'TermIdx', range(1, len(results) + 1))
else:
    results = pd.DataFrame({'TermIdx': range(1, 51)})
C:\Users\Business\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ["'d", "'ll", "'m", "'re", "'s", "'ve", 'abl', 'abov', 'accord', 'accordingli', 'actual', 'afterward', 'ai', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anybodi', 'anyon', 'anyth', 'anywher', 'appreci', 'appropri', 'asid', 'associ', 'avail', 'aw', 'becam', 'becaus', 'becom', 'befor', 'believ', 'besid', 'ca', 'caus', 'certainli', 'chang', 'clearli', 'concern', 'consequ', 'consid', 'correspond', 'cours', 'current', 'definit', 'describ', 'despit', 'differ', 'doe', 'downward', 'dure', 'els', 'elsewher', 'entir', 'especi', 'everi', 'everybodi', 'everyon', 'everyth', 'everywher', 'exactli', 'exampl', 'follow', 'formerli', 'furthermor', 'give', 'goe', 'ha', 'happen', 'hardli', 'henc', 'hereaft', 'herebi', 'hope', 'howev', 'ignor', 'immedi', 'inde', 'indic', 'late', 'latterli', 'littl', 'mainli', 'mani', 'mayb', 'meanwhil', 'mere', 'moreov', 'mostli', "n't", 'nearli', 'necessari', 'nobodi', 'noon', 'normal', 'noth', 'nowher', 'obvious', 'onc', 'onli', 'otherwis', 'ourselv', 'outsid', 'overal', 'particularli', 'perhap', 'place', 'pleas', 'plu', 'possibl', 'presum', 'probabl', 'provid', 'quit', 'realli', 'reason', 'regard', 'rel', 'respect', 'secondli', 'selv', 'sensibl', 'seriou', 'sever', 'sinc', 'somebodi', 'someon', 'someth', 'sometim', 'somewher', 'sorri', 'specifi', 'tend', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'thi', 'thoroughli', 'thu', 'togeth', 'tri', 'truli', 'unfortun', 'unlik', 'usual', 'valu', 'variou', 'veri', 'wa', 'welcom', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'wo', 'ye', 'yourselv'] not in stop_words.
  warnings.warn('Your stop_words may be inconsistent with '
Finished: Reduce Friction (RF4)
C:\Users\Business\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ["'d", "'ll", "'m", "'re", "'s", "'ve", 'abl', 'abov', 'accord', 'accordingli', 'actual', 'afterward', 'ai', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anybodi', 'anyon', 'anyth', 'anywher', 'appreci', 'appropri', 'asid', 'associ', 'avail', 'aw', 'becam', 'becaus', 'becom', 'befor', 'believ', 'besid', 'ca', 'caus', 'certainli', 'chang', 'clearli', 'concern', 'consequ', 'consid', 'correspond', 'cours', 'current', 'definit', 'describ', 'despit', 'differ', 'doe', 'downward', 'dure', 'els', 'elsewher', 'entir', 'especi', 'everi', 'everybodi', 'everyon', 'everyth', 'everywher', 'exactli', 'exampl', 'follow', 'formerli', 'furthermor', 'give', 'goe', 'ha', 'happen', 'hardli', 'henc', 'hereaft', 'herebi', 'hope', 'howev', 'ignor', 'immedi', 'inde', 'indic', 'late', 'latterli', 'littl', 'mainli', 'mani', 'mayb', 'meanwhil', 'mere', 'moreov', 'mostli', "n't", 'nearli', 'necessari', 'nobodi', 'noon', 'normal', 'noth', 'nowher', 'obvious', 'onc', 'onli', 'otherwis', 'ourselv', 'outsid', 'overal', 'particularli', 'perhap', 'place', 'pleas', 'plu', 'possibl', 'presum', 'probabl', 'provid', 'quit', 'realli', 'reason', 'regard', 'rel', 'respect', 'secondli', 'selv', 'sensibl', 'seriou', 'sever', 'sinc', 'somebodi', 'someon', 'someth', 'sometim', 'somewher', 'sorri', 'specifi', 'tend', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'thi', 'thoroughli', 'thu', 'togeth', 'tri', 'truli', 'unfortun', 'unlik', 'usual', 'valu', 'variou', 'veri', 'wa', 'welcom', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'wo', 'ye', 'yourselv'] not in stop_words.
  warnings.warn('Your stop_words may be inconsistent with '
Finished: Growth and Innovation (GI5)
C:\Users\Business\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ["'d", "'ll", "'m", "'re", "'s", "'ve", 'abl', 'abov', 'accord', 'accordingli', 'actual', 'afterward', 'ai', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anybodi', 'anyon', 'anyth', 'anywher', 'appreci', 'appropri', 'asid', 'associ', 'avail', 'aw', 'becam', 'becaus', 'becom', 'befor', 'believ', 'besid', 'ca', 'caus', 'certainli', 'chang', 'clearli', 'concern', 'consequ', 'consid', 'correspond', 'cours', 'current', 'definit', 'describ', 'despit', 'differ', 'doe', 'downward', 'dure', 'els', 'elsewher', 'entir', 'especi', 'everi', 'everybodi', 'everyon', 'everyth', 'everywher', 'exactli', 'exampl', 'follow', 'formerli', 'furthermor', 'give', 'goe', 'ha', 'happen', 'hardli', 'henc', 'hereaft', 'herebi', 'hope', 'howev', 'ignor', 'immedi', 'inde', 'indic', 'late', 'latterli', 'littl', 'mainli', 'mani', 'mayb', 'meanwhil', 'mere', 'moreov', 'mostli', "n't", 'nearli', 'necessari', 'nobodi', 'noon', 'normal', 'noth', 'nowher', 'obvious', 'onc', 'onli', 'otherwis', 'ourselv', 'outsid', 'overal', 'particularli', 'perhap', 'place', 'pleas', 'plu', 'possibl', 'presum', 'probabl', 'provid', 'quit', 'realli', 'reason', 'regard', 'rel', 'respect', 'secondli', 'selv', 'sensibl', 'seriou', 'sever', 'sinc', 'somebodi', 'someon', 'someth', 'sometim', 'somewher', 'sorri', 'specifi', 'tend', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'thi', 'thoroughli', 'thu', 'togeth', 'tri', 'truli', 'unfortun', 'unlik', 'usual', 'valu', 'variou', 'veri', 'wa', 'welcom', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'wo', 'ye', 'yourselv'] not in stop_words.
  warnings.warn('Your stop_words may be inconsistent with '
Finished: Vision (V7)
C:\Users\Business\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ["'d", "'ll", "'m", "'re", "'s", "'ve", 'abl', 'abov', 'accord', 'accordingli', 'actual', 'afterward', 'ai', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anybodi', 'anyon', 'anyth', 'anywher', 'appreci', 'appropri', 'asid', 'associ', 'avail', 'aw', 'becam', 'becaus', 'becom', 'befor', 'believ', 'besid', 'ca', 'caus', 'certainli', 'chang', 'clearli', 'concern', 'consequ', 'consid', 'correspond', 'cours', 'current', 'definit', 'describ', 'despit', 'differ', 'doe', 'downward', 'dure', 'els', 'elsewher', 'entir', 'especi', 'everi', 'everybodi', 'everyon', 'everyth', 'everywher', 'exactli', 'exampl', 'follow', 'formerli', 'furthermor', 'give', 'goe', 'ha', 'happen', 'hardli', 'henc', 'hereaft', 'herebi', 'hope', 'howev', 'ignor', 'immedi', 'inde', 'indic', 'late', 'latterli', 'littl', 'mainli', 'mani', 'mayb', 'meanwhil', 'mere', 'moreov', 'mostli', "n't", 'nearli', 'necessari', 'nobodi', 'noon', 'normal', 'noth', 'nowher', 'obvious', 'onc', 'onli', 'otherwis', 'ourselv', 'outsid', 'overal', 'particularli', 'perhap', 'place', 'pleas', 'plu', 'possibl', 'presum', 'probabl', 'provid', 'quit', 'realli', 'reason', 'regard', 'rel', 'respect', 'secondli', 'selv', 'sensibl', 'seriou', 'sever', 'sinc', 'somebodi', 'someon', 'someth', 'sometim', 'somewher', 'sorri', 'specifi', 'tend', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'thi', 'thoroughli', 'thu', 'togeth', 'tri', 'truli', 'unfortun', 'unlik', 'usual', 'valu', 'variou', 'veri', 'wa', 'welcom', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'wo', 'ye', 'yourselv'] not in stop_words.
  warnings.warn('Your stop_words may be inconsistent with '
Finished: Vision (V10)
C:\Users\Business\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ["'d", "'ll", "'m", "'re", "'s", "'ve", 'abl', 'abov', 'accord', 'accordingli', 'actual', 'afterward', 'ai', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anybodi', 'anyon', 'anyth', 'anywher', 'appreci', 'appropri', 'asid', 'associ', 'avail', 'aw', 'becam', 'becaus', 'becom', 'befor', 'believ', 'besid', 'ca', 'caus', 'certainli', 'chang', 'clearli', 'concern', 'consequ', 'consid', 'correspond', 'cours', 'current', 'definit', 'describ', 'despit', 'differ', 'doe', 'downward', 'dure', 'els', 'elsewher', 'entir', 'especi', 'everi', 'everybodi', 'everyon', 'everyth', 'everywher', 'exactli', 'exampl', 'follow', 'formerli', 'furthermor', 'give', 'goe', 'ha', 'happen', 'hardli', 'henc', 'hereaft', 'herebi', 'hope', 'howev', 'ignor', 'immedi', 'inde', 'indic', 'late', 'latterli', 'littl', 'mainli', 'mani', 'mayb', 'meanwhil', 'mere', 'moreov', 'mostli', "n't", 'nearli', 'necessari', 'nobodi', 'noon', 'normal', 'noth', 'nowher', 'obvious', 'onc', 'onli', 'otherwis', 'ourselv', 'outsid', 'overal', 'particularli', 'perhap', 'place', 'pleas', 'plu', 'possibl', 'presum', 'probabl', 'provid', 'quit', 'realli', 'reason', 'regard', 'rel', 'respect', 'secondli', 'selv', 'sensibl', 'seriou', 'sever', 'sinc', 'somebodi', 'someon', 'someth', 'sometim', 'somewher', 'sorri', 'specifi', 'tend', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'thi', 'thoroughli', 'thu', 'togeth', 'tri', 'truli', 'unfortun', 'unlik', 'usual', 'valu', 'variou', 'veri', 'wa', 'welcom', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'wo', 'ye', 'yourselv'] not in stop_words.
  warnings.warn('Your stop_words may be inconsistent with '
Finished: Mission (M11)
C:\Users\Business\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:388: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ["'d", "'ll", "'m", "'re", "'s", "'ve", 'abl', 'abov', 'accord', 'accordingli', 'actual', 'afterward', 'ai', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anybodi', 'anyon', 'anyth', 'anywher', 'appreci', 'appropri', 'asid', 'associ', 'avail', 'aw', 'becam', 'becaus', 'becom', 'befor', 'believ', 'besid', 'ca', 'caus', 'certainli', 'chang', 'clearli', 'concern', 'consequ', 'consid', 'correspond', 'cours', 'current', 'definit', 'describ', 'despit', 'differ', 'doe', 'downward', 'dure', 'els', 'elsewher', 'entir', 'especi', 'everi', 'everybodi', 'everyon', 'everyth', 'everywher', 'exactli', 'exampl', 'follow', 'formerli', 'furthermor', 'give', 'goe', 'ha', 'happen', 'hardli', 'henc', 'hereaft', 'herebi', 'hope', 'howev', 'ignor', 'immedi', 'inde', 'indic', 'late', 'latterli', 'littl', 'mainli', 'mani', 'mayb', 'meanwhil', 'mere', 'moreov', 'mostli', "n't", 'nearli', 'necessari', 'nobodi', 'noon', 'normal', 'noth', 'nowher', 'obvious', 'onc', 'onli', 'otherwis', 'ourselv', 'outsid', 'overal', 'particularli', 'perhap', 'place', 'pleas', 'plu', 'possibl', 'presum', 'probabl', 'provid', 'quit', 'realli', 'reason', 'regard', 'rel', 'respect', 'secondli', 'selv', 'sensibl', 'seriou', 'sever', 'sinc', 'somebodi', 'someon', 'someth', 'sometim', 'somewher', 'sorri', 'specifi', 'tend', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'thi', 'thoroughli', 'thu', 'togeth', 'tri', 'truli', 'unfortun', 'unlik', 'usual', 'valu', 'variou', 'veri', 'wa', 'welcom', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'wo', 'ye', 'yourselv'] not in stop_words.
  warnings.warn('Your stop_words may be inconsistent with '
Finished: Mission (M15)

By looking at the word clusters for each topic (e.g., RF4_Topic_1), we can assign a human-readable label. For instance, a topic with words like "communiti," "blockchain," "develop," "secur," and "support" is possibly about Governance.

No description has been provided for this image No description has been provided for this image No description has been provided for this image

5. Synthesizing Insights with the OpenAI API¶

We now have structured data: n-grams and topics. But we still need to produce a high-level summary. This is where an LLM like GPT-4o mini shines. We will ask the AI to act as an analyst, providing it with our topic keywords and the raw submissions, and instructing it to generate a synthesized report.

Step 1: Prepare the Prompt

The key to getting a good result from an LLM is a well-structured prompt. Our prompt will:

  1. Assign a role: "You are an expert blockchain community analyst."
  2. Provide context: Give it the top keywords from our LDA model.
  3. Provide the data: Give it the raw text submissions.
  4. Give clear instructions: Ask it to identify themes and find standout submissions.
  5. Specify the output format: Request clear headings for easy reading.
In [29]:
# --- OpenAI Summarization ---

# Let's focus on the "Reduce Friction" contest (RF4)
slug_to_summarize = 'RF4'
num_topics = contestdf.loc[contestdf['Slug'] == slug_to_summarize, 'Topics'].iloc[0]

# Extract top words for the contest
rf4_topic_cols = [col for col in results.columns if col.startswith(f'{slug_to_summarize}_Topic_')]
top_words_series = results[rf4_topic_cols].stack()
top_words = top_words_series.unique()

# Load all submissions for the contest from the saved text files
submissions = []
for i in range(1, num_topics + 1):
    file_path = os.path.join(output_dir, slug_to_summarize, f'Topic-{i}.txt')
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            submissions.extend(f.readlines())
    except FileNotFoundError:
        print(f"Warning: File not found {file_path}")
        
# Remove blank lines and submission headers
submissions_text = '\n'.join(submissions)
keywords_text = ', '.join(top_words)

prompt = f"""
You are an expert blockchain community analyst. Given the following user submissions and most common keywords, create a detailed summary for each of these categories:
1. Reduce Friction (overall)
2. Growth and Innovation
3. Vision
4. Mission

For each category, do the following:
- Identify and describe the main themes, concerns, and opportunities found in the submissions.
- Provide at least 2–3 representative or standout submissions (either as direct quotes or short paraphrases) that best capture the spirit or key insights of that category.
- Where relevant, highlight points of consensus, strong opinions, or recurring challenges and suggestions.

Most common keywords:
{keywords_text}

User submissions:
{submissions_text}

Format your output with clear headings for each category, and organize within each section as:
- Themes:
  [Detailed synthesis]
- Standout submissions:
  - [Quote or paraphrase 1]
  - [Quote or paraphrase 2]
  - [Quote or paraphrase 3]

If a submission is relevant to more than one category, you may include it in multiple sections.
"""

Step 2: Make the API Call

Next, we send our prompt to the OpenAI API using the requests package. For this exercise, you will need to acquire an API key from OpenAI. Prepaying for API credits is required. We are using gpt-4o-mini, but OpenAI has many available options.

Important: Never hardcode your API key directly in your script. For ease of use, load it from a separate, secure file (that you don't share or commit to version control). For a more advanced secure approach, store your API key as a system environment variable.

In [35]:
# --- Call OpenAI API ---
from IPython.display import display, Markdown

# Load your API key securely from a file
try:
    with open('../../openai_key.txt', 'r') as f:
        openai_api_key = f.read().strip()
except FileNotFoundError:
    openai_api_key = None
    print("OpenAI API key file not found. Skipping API call.")

if openai_api_key:
    headers = {
        'Authorization': f'Bearer {openai_api_key}',
        'Content-Type': 'application/json'
    }
    
    payload = {
        "model": "gpt-4o-mini",
        "messages": [{"role": "user", "content": prompt}],
        "temperature": 0.5,
        "max_tokens": 1500
    }

    try:
        response = requests.post(
            'https://api.openai.com/v1/chat/completions', 
            headers=headers, 
            data=json.dumps(payload)
        )
        response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
        
        result = response.json()
        content = result['choices'][0]['message']['content']
        
        # Display the formatted output
        print(content)
        
    except requests.exceptions.RequestException as e:
        error_content = e.response.text if e.response else str(e)
        display(Markdown(f'<div class="output-box">Error calling OpenAI API:\n{error_content}</div>'))
# Summary of User Submissions by Category

## 1. Reduce Friction (Overall)
### Themes:
The submissions highlight a strong consensus on the need to enhance user experience, streamline onboarding processes, and improve communication within the Arbitrum ecosystem. Key concerns include high gas fees, complex user interfaces, and the necessity for robust developer support. Many users emphasized the importance of transparency in governance and the need for effective community engagement to foster trust and participation.

### Standout Submissions:
- **"ArbitrumDAO must urgently address immediate friction, pain points, and barriers within our ecosystem. Prioritizing user-centricity and reducing obstacles for our community members, developers, and critical stakeholders is essential."**
- **"User onboarding simplification is paramount. Implement user-friendly guides and resources to ensure a seamless transition into our ecosystem."**
- **"Streamline governance processes to enhance decision-making efficiency while maintaining transparency and decentralization."**

## 2. Growth and Innovation
### Themes:
Many submissions focused on fostering innovation through community engagement, developer support, and strategic partnerships. Users expressed a desire for educational initiatives that would empower both developers and users. Several submissions emphasized the importance of creating a vibrant ecosystem through collaborations with other projects and enhancing interoperability.

### Standout Submissions:
- **"Encourage innovation and development on the Arbitrum platform by offering grants, developer tools, and documentation to simplify the onboarding process."**
- **"Foster strategic partnerships with other projects to expand the utility and reach of Arbitrum, creating a network effect."**
- **"Invest in educational resources and campaigns to help users understand the benefits of Arbitrum and how to use it effectively."**

## 3. Vision
### Themes:
The vision articulated in the submissions emphasizes creating a decentralized, user-centric, and transparent ecosystem. Many users expressed a desire for ArbitrumDAO to uphold its core values while navigating the challenges of the blockchain landscape. There is a strong call for community involvement in governance and decision-making processes, reflecting a commitment to inclusivity.

### Standout Submissions:
- **"Our vision for ArbitrumDAO is clear: to become a beacon of decentralized innovation and collaboration in the blockchain ecosystem."**
- **"Community is the key! DAO needs to organize more award-winning initiatives to foster engagement and participation."**
- **"Transparency and participation are essential for building trust within the community. Regularly communicate updates and encourage participation in governance."**

## 4. Mission
### Themes:
The mission articulated by users is to empower the community through education, transparency, and active participation in governance. There is a strong emphasis on creating a supportive environment for both users and developers, ensuring that the DAO aligns its actions with its stated values. The mission also encompasses enhancing security and addressing regulatory concerns.

### Standout Submissions:
- **"Our mission is to empower the global community to shape the future of decentralized finance, governance, and beyond."**
- **"To ensure long-term success, ArbitrumDAO should focus on addressing high gas fees, simplifying user onboarding, and supporting developers."**
- **"Fostering an active and engaged community is crucial for the long-term success of any project."**

---

This structured summary captures the main themes, concerns, and opportunities identified in user submissions related to reducing friction, fostering growth and innovation, articulating a clear vision, and establishing a mission for ArbitrumDAO.
No description has been provided for this image

6. Conclusion and Recommendations¶

By combining traditional NLP techniques with the power of modern LLMs, we have moved from thousands of raw text submissions to a concise, actionable summary of community feedback.

Key Findings

  • Identified Core Themes: Using LDA, we successfully identified and categorized the main topics of conversation, such as governance, onboarding, and developer tooling.
  • Pinpointed Specific Suggestions: N-gram analysis helped highlight specific, recurring phrases and proposals.
  • Synthesized Actionable Insights: The OpenAI API provided a high-quality narrative summary, saving hours of manual reading and interpretation.

Recommendations for Future Analysis

  • Improve Prompting: To get more structured data from the community, design submission forms with clearer, more focused prompts.
  • Automate Filtering: Use regular expressions and readability metrics to automatically filter out low-quality or spam submissions.
  • Integrate Voting Data: Correlate the topics of proposals with their voting outcomes to see which ideas gained the most traction.